AITopics | audio waveform

Collaborating Authors

audio waveform

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

b782a3462ee9d566291cff148333ea9b-Paper-Conference.pdf

Neural Information Processing SystemsFeb-17-2026, 17:11:11 GMT

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry:

Leisure & Entertainment (1.00)
Media > Music (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

T ell What You Hear From What You See - Video to Audio Generation Through Text

Neural Information Processing SystemsOct-10-2025, 14:21:14 GMT

When the audio caption is provided as a prompt, V A TT achieves even more refined performance (with lowest KLD score of 1.41).

audio caption, caption, video, (12 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry:

Leisure & Entertainment (1.00)
Media > Music (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

MahaTTS: A Unified Framework for Multilingual Text-to-Speech Synthesis

Singh, Jaskaran, Chowdhury, Amartya Roy, Prabhakar, Raghav, W, Varshul C.

arXiv.org Artificial IntelligenceAug-21-2025

Current Text-to-Speech models pose a multilingual challenge, where most of the models traditionally focus on English and European languages, thereby hurting the potential to provide access to information to many more people. To address this gap, we introduce MahaTTS-v2 a Multilingual Multi-speaker Text-To-Speech (TTS) system that has excellent multilingual expressive capabilities in Indic languages. The model has been trained on around 20K hours of data specifically focused on Indian languages. Our approach leverages Wav2Vec2.0 tokens for semantic extraction, and a Language Model (LM) for text-to-semantic modeling. Additionally, we have used a Conditional Flow Model (CFM) for semantics to melspectogram generation. The experimental results indicate the effectiveness of the proposed approach over other frameworks. Our code is available at https://github.com/dubverse-ai/MahaTTSv2

artificial intelligence, arxiv preprint arxiv, natural language, (14 more...)

arXiv.org Artificial Intelligence

2508.14049

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Tell What You Hear From What You See -- Video to Audio Generation Through Text

Liu, Xiulong, Su, Kun, Shlizerman, Eli

arXiv.org Artificial IntelligenceDec-28-2024

The content of visual and audio scenes is multi-faceted such that a video can be paired with various audio and vice-versa. Thereby, in video-to-audio generation task, it is imperative to introduce steering approaches for controlling the generated audio. While Video-to-Audio generation is a well-established generative task, existing methods lack such controllability. In this work, we propose VATT, a multi-modal generative framework that takes a video and an optional text prompt as input, and generates audio and optional textual description of the audio. Such a framework has two advantages: i) Video-to-Audio generation process can be refined and controlled via text which complements the context of visual information, and ii) The model can suggest what audio to generate for the video by generating audio captions. VATT consists of two key modules: VATT Converter, a LLM that is fine-tuned for instructions and includes a projection layer that maps video features to the LLM vector space; and VATT Audio, a transformer that generates audio tokens from visual frames and from optional text prompt using iterative parallel decoding. The audio tokens are converted to a waveform by pretrained neural codec. Experiments show that when VATT is compared to existing video-to-audio generation methods in objective metrics, it achieves competitive performance when the audio caption is not provided. When the audio caption is provided as a prompt, VATT achieves even more refined performance (lowest KLD score of 1.41). Furthermore, subjective studies show that VATT Audio has been chosen as preferred generated audio than audio generated by existing methods. VATT enables controllable video-to-audio generation through text as well as suggesting text prompts for videos through audio captions, unlocking novel applications such as text-guided video-to-audio generation and video-to-audio captioning.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2411.05679

Country: North America > United States (0.67)

Genre: Research Report > Experimental Study (1.00)

Industry:

Leisure & Entertainment (1.00)
Media > Music (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models

Kang, Mintong, Xu, Chejian, Li, Bo

arXiv.org Artificial IntelligenceDec-11-2024

Recent advancements in large audio-language models (LALMs) have enabled speech-based user interactions, significantly enhancing user experience and accelerating the deployment of LALMs in real-world applications. However, ensuring the safety of LALMs is crucial to prevent risky outputs that may raise societal concerns or violate AI regulations. Despite the importance of this issue, research on jailbreaking LALMs remains limited due to their recent emergence and the additional technical challenges they present compared to attacks on DNNbased audio models. Specifically, the audio encoders in LALMs, which involve discretization operations, often lead to gradient shattering, hindering the effectiveness of attacks relying on gradient-based optimizations. The behavioral variability of LALMs further complicates the identification of effective (adversarial) optimization targets. Moreover, enforcing stealthiness constraints on adversarial audio waveforms introduces a reduced, non-convex feasible solution space, further intensifying the challenges of the optimization process. To overcome these challenges, we develop AdvWave, the first jailbreak framework against LALMs. We propose a dual-phase optimization method that addresses gradient shattering, enabling effective end-to-end gradient-based optimization. Additionally, we develop an adaptive adversarial target search algorithm that dynamically adjusts the adversarial optimization target based on the response patterns of LALMs for specific queries. To ensure that adversarial audio remains perceptually natural to human listeners, we design a classifier-guided optimization approach that generates adversarial noise resembling common urban sounds. Furthermore, we employ an iterative adversarial audio refinement technique to achieve near-perfect jailbreak success rates on black-box LALMs, requiring fewer than 30 queries per instance.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2412.08608

Country: North America > United States > Illinois (0.04)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Law > Statutes (0.66)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.99)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.96)

Add feedback

PALM: Few-Shot Prompt Learning for Audio Language Models

Hanif, Asif, Agro, Maha Tufail, Qazi, Mohammad Areeb, Aldarmaki, Hanan

arXiv.org Artificial IntelligenceSep-29-2024

Audio-Language Models (ALMs) have recently achieved remarkable success in zero-shot audio recognition tasks, which match features of audio waveforms with class-specific text prompt features, inspired by advancements in Vision-Language Models (VLMs). Given the sensitivity of zero-shot performance to the choice of hand-crafted text prompts, many prompt learning techniques have been developed for VLMs. We explore the efficacy of these approaches in ALMs and propose a novel method, Prompt Learning in Audio Language Models (PALM), which optimizes the feature space of the text encoder branch. Unlike existing methods that work in the input space, our approach results in greater training efficiency. We demonstrate the effectiveness of our approach on 11 audio recognition datasets, encompassing a variety of speech-processing tasks, and compare the results with three baselines in a few-shot learning setup. Our method is either on par with or outperforms other approaches while being computationally less demanding. Code is available at https://asif-hanif.github.io/palm/

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2409.19806

Country:

Asia > China > Beijing > Beijing (0.05)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Europe > Finland > Pirkanmaa > Tampere (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound

Lee, Junwon, Im, Jaekwon, Kim, Dabin, Nam, Juhan

arXiv.org Artificial IntelligenceAug-21-2024

Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video both temporally and semantically. Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges. Systems lacking explicit temporal features suffer from poor controllability and alignment, while timestamp-based models require costly and subjective human annotation. We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as a temporal event condition with semantic timbre prompts (audio or text). RMS, a frame-level intensity envelope feature closely related to audio semantics, ensures high controllability and synchronization. The annotation-free self-supervised learning framework consists of two stages, Video2RMS and RMS2Sound, incorporating novel ideas including RMS discretization and RMS-ControlNet with a pretrained text-to-audio model. Our extensive evaluation shows that Video-Foley achieves state-of-the-art performance in audio-visual alignment and controllability for sound timing, intensity, timbre, and nuance. Code, model weights, and demonstrations are available on the accompanying website. (https://jnwnlee.github.io/video-foley-demo)

rms, video-foley, video2rms, (15 more...)

arXiv.org Artificial Intelligence

2408.11915

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Speech Audio Synthesis from Tagged MRI and Non-Negative Matrix Factorization via Plastic Transformer

Liu, Xiaofeng, Xing, Fangxu, Stone, Maureen, Zhuo, Jiachen, Fels, Sidney, Prince, Jerry L., Fakhri, Georges El, Woo, Jonghye

arXiv.org Artificial IntelligenceSep-25-2023

The tongue's intricate 3D structure, comprising localized functional units, plays a crucial role in the production of speech. When measured using tagged MRI, these functional units exhibit cohesive displacements and derived quantities that facilitate the complex process of speech production. Non-negative matrix factorization-based approaches have been shown to estimate the functional units through motion features, yielding a set of building blocks and a corresponding weighting map. Investigating the link between weighting maps and speech acoustics can offer significant insights into the intricate process of speech production. To this end, in this work, we utilize two-dimensional spectrograms as a proxy representation, and develop an end-to-end deep learning framework for translating weighting maps to their corresponding audio waveforms. Our proposed plastic light transformer (PLT) framework is based on directional product relative position bias and single-level spatial pyramid pooling, thus enabling flexible processing of weighting maps with variable size to fixed-size spectrograms, without input information loss or dimension expansion. Additionally, our PLT framework efficiently models the global correlation of wide matrix input. To improve the realism of our generated spectrograms with relatively limited training samples, we apply pair-wise utterance consistency with Maximum Mean Discrepancy constraint and adversarial training. Experimental results on a dataset of 29 subjects speaking two utterances demonstrated that our framework is able to synthesize speech audio waveforms from weighting maps, outperforming conventional convolution and transformer models.

audio waveform, spectrogram, weighting map, (13 more...)

arXiv.org Artificial Intelligence

2309.14586

Country:

North America > United States > Maryland > Baltimore (0.14)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Genre: Research Report (0.64)

Industry: Health & Medicine (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AutoTTS: End-to-End Text-to-Speech Synthesis through Differentiable Duration Modeling

Nguyen, Bac, Cardinaux, Fabien, Uhlich, Stefan

arXiv.org Artificial IntelligenceMar-7-2023

Parallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis. However, they typically require external alignment models, which are not necessarily optimized for the decoder as they are not jointly trained. In this paper, we propose a differentiable duration method for learning monotonic alignments between input and output sequences. Our method is based on a soft-duration mechanism that optimizes a stochastic process in expectation. Using this differentiable duration method, we introduce AutoTTS, a direct text-to-waveform speech synthesis model. AutoTTS enables high-fidelity speech synthesis through a combination of adversarial training and matching the total ground-truth duration. Experimental results show that our model obtains competitive results while enjoying a much simpler training pipeline. Audio samples are available online.

artificial intelligence, autotts, speech synthesis, (16 more...)

arXiv.org Artificial Intelligence

2203.11049

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.04)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)

Add feedback

Conditional variational autoencoder to improve neural audio synthesis for polyphonic music sound

Lee, Seokjin, Kim, Minhan, Shin, Seunghyeon, Lee, Daeho, Jang, Inseon, Lim, Wootaek

arXiv.org Artificial IntelligenceNov-16-2022

Deep generative models for audio synthesis have recently been significantly improved. However, the task of modeling raw-waveforms remains a difficult problem, especially for audio waveforms and music signals. Recently, the realtime audio variational autoencoder (RAVE) method was developed for high-quality audio waveform synthesis. The RAVE method is based on the variational autoencoder and utilizes the two-stage training strategy. Unfortunately, the RAVE model is limited in reproducing wide-pitch polyphonic music sound. Therefore, to enhance the reconstruction performance, we adopt the pitch activation data as an auxiliary information to the RAVE model. To handle the auxiliary information, we propose an enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer. To evaluate the proposed structure, we conducted a listening experiment based on multiple stimulus tests with hidden references and an anchor (MUSHRA) with the MAESTRO. The obtained results indicate that the proposed model exhibits a more significant performance and stability improvement than the conventional RAVE model.

artificial intelligence, machine learning, rave model, (17 more...)

arXiv.org Artificial Intelligence

2211.08715

Country: North America > Montserrat (0.04)

Genre: Research Report (0.64)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback